When building a predictive model, it is likely that not only will you be interested in the effectiveness of your model as a predictor, but also that you will be interested in what features/predictors are important. Eliminating useless predictors will simplify your model and might even improve it by reducing noise.
The first, and most simplistic way to go about this is to eliminate predictors with variance below some threshold. This isn't very robust and isn't illustrated here.
In [1]:
from IPython.core.pylabtools import figsize
import numpy as np
from sklearn.datasets import make_regression
from statsmodels.api import OLS
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path, LassoCV
from sklearn.ensemble import RandomForestRegressor as RF
from sklearn.feature_selection import f_regression
%matplotlib inline
plt.style.use('bmh')
In [2]:
X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 3, tail_strength = 1, random_state=0)
The next (and the traditional) appproach: make the requisite assumptions, then do inference based on $t$-statistics and $p$-values. To do this properly, we should follow up with either forward or backward stepwise regression. When performing stepwise regression, all possible sets of predictors should be tested. This is more robust than simply eliminating predictors with low variance, but when the number of predictors in large it can be impractical.
In [3]:
model = OLS(y, X)
results = model.fit()
print(results.summary())
Scikit-learn implements an $F$-test to assess the significance of each predictor considered individually. The null hypothesis is that a model built on that predictor (only) is no better than a null model, and the alternative hypothesis is that it is better. This can be combined with functions such as SelectKBest to quickly select features - see http://scikit-learn.org/stable/modules/feature_selection.html. However, on its own it doesn't allow for basic inferences such as confidence intervals for those predictors.
In [4]:
scores = f_regression(X, y)
In [5]:
print(scores[0]) # F-statistics
print(np.argmin(scores[1])) # index of smallest p-value
In [6]:
np.argpartition(scores[1], 3)[:3] # returns indices of minimum 3 p-values in list; i.e. top 3 features
Out[6]:
I generally prefer $l1$ regularization, which performs a 'continuous' version of stepwise variable selection, to traditional stepwise selection or sklearn's F-test. This needs cross-validation to tune effectively. The downside to it is that it often misses highly correlated features (typically, will latch onto one and drop the others). That likely won't effect the predictive quality of the final model, but may effect the stability of the model when refit to different data.
In [7]:
# lasso = l1 regularized linear regression
lasso_model = LassoCV(n_jobs=-1, random_state=0, normalize=True, verbose=1, copy_X = True)
lasso_model.fit(X,y)
lasso_model.coef_
Out[7]:
In [8]:
[x for x in range(len(lasso_model.coef_)) if x not in np.where(np.isclose(0.0, lasso_model.coef_))[0]]
Out[8]:
In [9]:
eps = 5e-5 #alpha_min / alpha_max; smaller = longer path
alphas_lasso, coefs_lasso, _ = lasso_path(X, y, eps)
figsize(12,6)
plt.plot(np.log(alphas_lasso), coefs_lasso.T)
plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients');
A fourth approach (another of my favorites): use random forest feature importances. This requires fitting the random forest model in full, which can be computationally expensive.
In [10]:
rf_model = RF(n_jobs=-1, verbose=1, n_estimators=500)
rf_model.fit(X,y)
importance_scores = rf_model.feature_importances_
np.argpartition(importance_scores, -3)[-3:] # need the maximum values here
Out[10]:
There are many other methods. For a detailed list of options available in scikit-learn, look here: http://scikit-learn.org/stable/modules/feature_selection.html.
In [11]:
plt.scatter(importance_scores, range(len(importance_scores)))
plt.xlabel('Importance')
plt.ylabel('Variable')
plt.title('Random Forest Variable Importances');
In [ ]: